Tranposon insertion sequencing (Tn-seq) is a robust and sensitive method to tackle the discovery of quantitative genetic interactions in with transposons microorganisms. This techniques combines massive parallel sequencing with transposon insertional mutagenesis. Transposons are DNA sequences that change their position within the genome. In order to carry out Tn-seq, a transposon insertion library containing a group of mutants that collectively have transposon insertions in all non-essential genes is created. The library is then grown under an experimental condition of interest. Mutants with transposons inserted in genes required for growth under the test condition will diminish in frequency from the population. Genomic sequences adjacent to the transposon ends are amplified by PCR and sequenced by massively parallel sequencing to determine the location and abundance of each insertion mutation. The importance of each gene for growth under the test condition is determined by comparing the abundance of each mutant before and after growth under the condition being examined. If an inserted mutation has negative effect on bacterial growth then a lower number of mapping reads will be found for that gene compared to the non-mutatated gene, the opposite occurs for insertations that positevily affect bacterial growth.
This is project is heavily based on the paper Tn-seq: high-throughput parallel sequencing for fitness and genetic interaction studies in microorganisms by van Opijnen, Bodi, and Camilli. They focused their work on a S. pneumoniae culture was transformed with the magellan6 transposon. In this paper they calculated a fitness score for each gene to determine the effect of the inserted mutation. This score is computed by first counting the number of mapping reads at each transposon insertion location for two different time points and then by by comparing the fold expansion of the mutant relative to the rest of the population with the following equation. The equation used is shown below.
Fitness Score
Given the fitness score for each insertion, average fitness score per gene was then calculated by averaging over all insertions in a specific gene. Based on this average score, genes were divided into four categories:
library(dplyr)
library(patchwork)
library(SingleR)
library(factoextra)
library(kableExtra)
library(ggplot2)
library("gridExtra")The plots below display the number of genes belonging to each category as well as the fitness value for each gene against the gene coordinate defined as the middle point between the start and the end of each gene.
fData_old <- read.delim("Tn_seq_fitness_data_Opijnen_et_al_2009.txt",
header=TRUE, stringsAsFactors = FALSE, sep = "\t")
# locus (gene locus) average_fitness sd sem N_insertions
# gene coordinates, gene locus + coordinates
geneCoord <- read.delim("GCF_000006885.1_ASM688v1_genomic_olt.txt",
header=FALSE, stringsAsFactors = FALSE, sep = "\t")
matches <- match(fData_old$locus,geneCoord$V1)
# 1813 matches in total
fData <- fData_old[!is.na(matches),]
essential <- length(fData$locus[fData$average_fitness==0])
# genes with fitness equal to 0 are considered essential genes
# transposons insertions occur only for non-essential genes
matches <- matches[!is.na(matches)]
geneCoord <- geneCoord[matches,]
genome_size <- max(geneCoord$V3)
fitness_df <- data.frame(x=(geneCoord$V2[fData$average_fitness!=0] +
geneCoord$V3[fData$average_fitness!=0])/2,
y=fData$average_fitness[fData$average_fitness!=0])
colnames(fitness_df) <- c('gene_coordinates', 'avg_fit_per_gene')
# according to the thresholds used in the articles,
# each gene is assigned to a insertion_type class depending on its avg fitness value
fitness_df$insertion_type <- cut(fitness_df$avg_fit_per_gene,
c(0,0.96,1.04,Inf),
c('Disadvantage','Neutral','Advantage'))
insertion_type <- as.data.frame(table(fitness_df$insertion_type))
levels(insertion_type$Var1) <- c(levels(insertion_type$Var1), 'essential')
insertion_type <- rbind(insertion_type,list('essential',essential))
colnames(insertion_type) <- c('group','value')
# pie chart for insertion types
ggplot(insertion_type, aes(x="",y=value,fill=group)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) + theme_void() +
geom_text(aes(label=value), position = position_stack(vjust = 0.5)) +
scale_fill_manual(name='Insertion Type',
labels=c('Disadvantage','Neutral','Advantage','Essential'),
values =c("#CC6666", "#9999CC", "#66CC99",'#FF9966')) #Plot smile shape with insertion type
ggplot(fitness_df, aes(x=gene_coordinates, y=avg_fit_per_gene)) +
geom_point(aes(color=insertion_type), size=1) +
geom_smooth(method = 'loess',color='#990000') +
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Genomic Coordinates", y = "Avg Gene Fitness",
title= "Fitness vs Genome Location") By looking at the second plot, it can be noticed that the fitness curve as a sort of ‘smile’ shape depicting a trend where genes lying close to the origin of replication shows higher fitness score than genes farther away. This is somehow expected due to the circular shape of bacterial genome. In fact, genome regions proximal to the origin of replication are present on average in a higher number of copies than distal ones. In this project we tried to tackle this problem by finding an appropriate pipeline to correct the average fitness scores.
The gene coordinates provided are linearized, though bacterial genome is circular therefore it should be preferred to work with circular coordinates instead of linear ones. For this matter, we transformed linear gene coordinates into radians. Moreover, genes at the two extremities have the same distance to the ORI, therefore radians coordinates have been rescaled so all coordinates range between 0 and \(\pi\) regardless of their location on the genome. In this way, instead of correcting the bias using one half of the genome, we can correct the bias by using the two halves of the genome.
#Plot smile shape
ggplot(fitness_df, aes(gene_coordinates,avg_fit_per_gene)) +
geom_point(size = 1) + geom_smooth(method = 'loess')+
labs(x = "Genomic Coordinates", y = "Avg gene fitness",
title= "Fitness vs Genome Location")# transforming gene coordinates into radians
fitness_df$radians <- (fitness_df$gene_coordinates/genome_size)*2*pi
# rescaling radians in order to conider distance from origin instead of genomic location
fitness_df$scaled_radians <- ifelse(fitness_df$radians>pi,
fitness_df$radians-2*pi, fitness_df$radians)
fitness_df$scaled_radians <- abs(fitness_df$scaled_radians)
# Bias using radians coordianetes
ggplot(fitness_df, aes(x=scaled_radians, y=avg_fit_per_gene)) +
geom_point(size = 1) + geom_smooth(method = 'lm') +
labs(x = "Scaled Radians", y = "Avg Gene Fitness",
title= "Fitness vs Genome Location") As shown in the plot, now fitness score displays a linear trend with a slightly negative slope.
We tried to remove the linear trend to see if this was enough to properly account for the bias introduced by the circularity of the genome. We first used a train-test approach. We randomly sampled half of the the genes and used those for training the linear model, and then tested the model on the remaining test genes. More precisely, we fitted the linear model on training data, used the model to predict the average fitness for the test genes, and then corrected the observed test gene fitness scores by removing the predicted scores and adding back the average test fitness score.
set.seed(42)
train_size <- ceiling((nrow(fitness_df)/2))
train <- sample(c(1:nrow(fitness_df)), size = train_size)
train_data <- fitness_df[train,]
test_data <- fitness_df[-train,]
model_train <- lm(avg_fit_per_gene ~ scaled_radians, train_data)
summary(model_train)
Call:
lm(formula = avg_fit_per_gene ~ scaled_radians, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-0.282911 -0.011063 0.006009 0.018954 0.243152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.030184 0.003258 316.192 <2e-16 ***
scaled_radians -0.015776 0.001764 -8.943 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04606 on 784 degrees of freedom
Multiple R-squared: 0.09256, Adjusted R-squared: 0.0914
F-statistic: 79.97 on 1 and 784 DF, p-value: < 2.2e-16
test_data$fitted_fitness <- test_data$avg_fit_per_gene -
predict(model_train, test_data) + mean(test_data$avg_fit_per_gene)
ggplot(test_data, aes(x=gene_coordinates, y=fitted_fitness)) +
geom_point(size=0.5) +
geom_smooth(method = 'lm', color='#990000') +
scale_color_manual(name= 'Insertion Type') +
labs(x = "Genomic Coordinates", y = "Avg Gene Fitness",
title= "Corrected Fitness vs Genome Location - Test")
Call:
lm(formula = fitted_fitness ~ gene_coordinates, data = test_data)
Residuals:
Min 1Q Median 3Q Max
-0.48876 -0.00928 0.01046 0.02400 0.23613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.938e-01 4.160e-03 238.877 <2e-16 ***
gene_coordinates 1.803e-09 3.290e-09 0.548 0.584
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.05704 on 783 degrees of freedom
Multiple R-squared: 0.0003835, Adjusted R-squared: -0.0008932
F-statistic: 0.3004 on 1 and 783 DF, p-value: 0.5838
The fitness score displays a constant trend indicating that the bias was accounted for. We further tested this by fitting a linear model with the correct fitness score as the response variable and gene coordinates as the explanatory variable. The p-value for slope term is quite large ( > 0.05), indicating there is no significant slope.
Given this result, we applied the same correction procedure, using the entire dataset to fit the linear model this time.
Call:
lm(formula = avg_fit_per_gene ~ scaled_radians, data = fitness_df)
Residuals:
Min 1Q Median 3Q Max
-0.49342 -0.00952 0.00841 0.02161 0.24572
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.028171 0.002626 391.54 <2e-16 ***
scaled_radians -0.016010 0.001436 -11.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.05187 on 1569 degrees of freedom
Multiple R-squared: 0.07345, Adjusted R-squared: 0.07286
F-statistic: 124.4 on 1 and 1569 DF, p-value: < 2.2e-16
# fitted line
ggplot(fitness_df, aes(x=scaled_radians,y=avg_fit_per_gene)) +
geom_point(aes(color=insertion_type), size=0.5) +
geom_smooth(method = 'lm',color='#990000')+
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Scaled Radians", y = "Avg Gene Fitness",
title= "Fitness vs Genome Location") fitness_new_predict <- predict(model_tot, newdata=fitness_df)
fitness_df$pred_fitness <- fitness_df$avg_fit_per_gene -
fitness_new_predict + mean(fitness_df$avg_fit_per_gene)
fitness_df$new_int_type <- cut(fitness_df$pred_fitness,
c(0,0.96,1.04,Inf),
c('Disadvantage','Neutral','Advantage'))
# Corrected model radians
ggplot(fitness_df,aes(x=scaled_radians,y=pred_fitness)) +
geom_point(aes(color=new_int_type), size=0.5) +
geom_smooth(method = 'lm', color='#990000')+
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Scaled Radians", y = "Avg Gene Fitness",
title= " Corrected Fitness vs Genome Location") # Corrected model genomic locations
ggplot(fitness_df, aes(x=gene_coordinates,y=pred_fitness)) +
geom_point(aes(color=new_int_type),size=0.5) +
geom_smooth(method = 'lm', color='#990000')+
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Genomic Coordinates", y = "Avg Gene Fitness",
title= "Corrected Fitness vs Genome Location") # df for results
results <- as.data.frame(table(fitness_df$insertion_type))
results <- rbind(results,as.data.frame(table(fitness_df$new_int_type)))
correction <-factor(
c("no", "no", "no", "all points", "all points", "all points"),
levels = c('no', 'all points'))
results <- cbind(results, correction)
# barplot results
ggplot(results, aes(x=Var1, y=Freq, fill=correction)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_manual(name='Correction', values = c("#CC6666", "#9999CC"),
labels=c('No','All Points')) +
labs(x = "Insertion Type", y = "Counts") It is possible to see that after correcting the fitness values as proposed above, more genes fall within the neutral category.
We decided to use this same approach, though we considered windows of nucleotides and used the average point in the window (average radian for each window) as X in the linear model instead of all single genes, and the average fitness score per window as Y, where the average fitness value for the window is the average of all gene falling within the window. We decided to also try this method as a model fitted using all genes may be more sensitive to noise in the data. With this more robust approach we wanted to see if we could improve further the results. We tested different window sizes (10’000, 50’000, 100’000, 200’000).
result_window_size_df <- data.frame("Insertion Type"=rep(c('Disadvantage', 'Neutral','Advantage'),4),
'Count'= rep(0,12),
'Correction'=rep(c('window10k', 'window50k','window100k', 'window200k'),
each=3))
result_window_size_df$Insertion.Type <- factor(c('Disadvantage', 'Neutral','Advantage'),levels = c('Disadvantage', 'Neutral','Advantage'))
#result_window_size_df$Correction<- factor(c('window10k', 'window50k','window100k', 'window200k'),labels = c('window10k', 'window50k','window100k', 'window200k'))
window_sizes <- c(10000, 50000, 100000, 200000)
for (window_size in window_sizes) {
windows <- seq(0, genome_size, by=window_size)
windows <- c(windows, genome_size)
window <- cut(fitness_df$gene_coordinates, windows)
fitness_df$window <- window
head(fitness_df)
# df per window
avg_fit_per_window <- with(
fitness_df, by(fitness_df$avg_fit_per_gene, fitness_df$window, mean))
avg_fit_per_window <- sapply(avg_fit_per_window, mean)
mid_radian <- sapply(with(
fitness_df, by(fitness_df$scaled_radians, fitness_df$window, mean)),
mean)
fitness_window <- data.frame(window =levels(window),
avg_fitness=avg_fit_per_window)
fitness_window$mid_radian <- mid_radian
head(fitness_window)
fitness_window$gene_number <- table(fitness_df$window)
model_window <- lm(avg_fitness ~ mid_radian, fitness_window)
summary(model_window)
new_window_avg <- predict(model_window, fitness_window)
fitness_df$new_pred_window <- fitness_df$avg_fit_per_gene -
new_window_avg[fitness_df$window] + mean(fitness_df$avg_fit_per_gene)
fitness_df$new_int_type_w <- cut(fitness_df$new_pred_window,
c(0,0.96,1.04,Inf),
c('Disadvantage','Neutral','Advantage'))
table_count <- table(fitness_df$new_int_type)
index <- match(window_size, window_sizes)
shift <- (index - 1)*3
result_window_size_df[(shift+1):(shift+3),'Count'] = table_count
}
ggplot(result_window_size_df, aes(x=Insertion.Type, y=Count, fill=Correction)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_manual(name= 'Correction',
values = c("#CC6666", "#9999CC","#66CC99", "#FF9966"),
labels=c('Window 10K', 'Window 50k', 'Window 100k',
"Window 200k")) +
labs(x = "Insertion Type", y = "Counts") # summary table
#colnames(result_window_size_df) <- c('Insertion Type', 'Count', 'Correction')
summary_table <- xtabs(Count ~., result_window_size_df)
t(summary_table) %>%
kable(caption = 'Summary table',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| Disadvantage | Neutral | Advantage | |
|---|---|---|---|
| window100k | 170 | 1235 | 166 |
| window10k | 170 | 1235 | 166 |
| window200k | 170 | 1235 | 166 |
| window50k | 170 | 1235 | 166 |
All different window sizes returned very similar results. We chose 100,000 as the optimal window size as it was a good trade-off between number of windows to fit and number of nucleotides per window.
# using windows
window_size <- 100000
windows <- seq(0, genome_size, by=window_size)
windows <- c(windows, genome_size)
window <- cut(fitness_df$gene_coordinates, windows)
fitness_df$window <- window
# df per window
avg_fit_per_window <- with(
fitness_df, by(fitness_df$avg_fit_per_gene, fitness_df$window, mean))
avg_fit_per_window <- sapply(avg_fit_per_window, mean)
mid_radian <- sapply(with(
fitness_df, by(fitness_df$scaled_radians, fitness_df$window, mean)),
mean)
fitness_window <- data.frame(window =levels(window),
avg_fitness=avg_fit_per_window)
fitness_window$mid_radian <- mid_radian
fitness_window$gene_number <- table(fitness_df$window)
# distribution of fitness values per windows
ggplot(fitness_df, aes(x=window, y=avg_fit_per_gene, group=window, fill = window)) +
geom_boxplot() +
theme(axis.title.y = element_text(size = 17),
axis.title.x = element_text(size = 17),
plot.title = element_text(size=22),
axis.text.x = element_blank()) +
xlab("Window") +
ylab("Fitness") +
ggtitle("Distribution of fitness values per window") +
ylim(0.7, 1.18) +
scale_fill_manual(values = rainbow(22), labels = seq(1,22))# fitness x radians with window
ggplot(fitness_window, aes(mid_radian, avg_fitness))+
geom_point(size = 1) + geom_smooth(method = 'loess') + labs(x = "Scaled Radians", y = "Avg Gene Fitness",
title= "Fitness vs Genome Location (window)")
Call:
lm(formula = avg_fitness ~ mid_radian, data = fitness_window)
Residuals:
Min 1Q Median 3Q Max
-0.012666 -0.006892 0.001237 0.005757 0.012309
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.028285 0.003203 321.040 < 2e-16 ***
mid_radian -0.016119 0.001779 -9.059 1.62e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.007649 on 20 degrees of freedom
Multiple R-squared: 0.8041, Adjusted R-squared: 0.7943
F-statistic: 82.07 on 1 and 20 DF, p-value: 1.622e-08
new_window_avg <- predict(model_window, fitness_window)
fitness_df$new_pred_window <- fitness_df$avg_fit_per_gene -
new_window_avg[fitness_df$window] + mean(fitness_df$avg_fit_per_gene)
fitness_df$new_int_type_w <- cut(fitness_df$new_pred_window,
c(0,0.96,1.04,Inf),
c('Disadvantage','Neutral','Advantage'))# corrected with window and insertion type (radians)
ggplot(fitness_df, aes(x=scaled_radians, y=new_pred_window))+
geom_point(aes(color=new_int_type_w), size=0.5) +
geom_smooth(method = 'lm', color='#990000') +
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Scaled Radians", y = "Avg Gene Fitness",
title= "Fitness vs Genome Location (window)") # corrected with window and insertion type (genomic location)
ggplot(fitness_df, aes(x=gene_coordinates, y=new_pred_window)) +
geom_point(aes(color=new_int_type_w), size=0.5) +
geom_smooth(method = 'lm', color='#990000') +
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Genomic Coordinates", y = "Avg Gene Fitness",
title= "Fitness vs Genome Location (window)") # df with results
correction <- c('window100k', 'window100k', 'window100k')
table_window <- cbind(as.data.frame(table(fitness_df$new_int_type_w)),
correction)
results <- rbind(results, table_window)
ggplot(results, aes(x=Var1, y=Freq, fill=correction)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_manual(name= 'Correction',
values = c("#CC6666", "#9999CC","#66CC99", "#FF9966"),
labels=c('No', 'All Points', 'Group', 'Window 100k')) +
labs(x = "Insertion Type", y = "Counts") As shown in the plot above, there is no much of a difference between the all-points approach and window approach as both return very similar results in terms of fitness of correction.
So far we have considered windows having the same length, though has the bar plot below shows, each window has a different number of genes. Therefore this time, instead of using windows of fixed length, we created windows of variable length but having fixed number of genes per window.
## df per gene per window
gene_window_df <- as.data.frame(table(fitness_df$window))
colnames(gene_window_df) <- c("window", "geneNumber")
ggplot(gene_window_df, aes(x=window, y=geneNumber, fill = "c", color="c")) +
geom_bar(stat="identity", alpha = 0.5, show.legend = FALSE) +
scale_fill_manual(values=c("#990800")) +
scale_color_manual(values = c("#990800")) +
theme(axis.title.y = element_blank(),
axis.title.x = element_text(size = 17),
plot.title = element_text(size=22)) +
xlab("Number of genes per window") +
ggtitle("Distribution of number of genes per window")[1] 71.40909
We decided that we wanted to have as many windows as the number of windows we had for the window approach presented earlier. Therefore we cretated 22 windows each spanning either 71 or 72 genes. Then we repated the same steps as in the previous cases.
# WINDOW WITH FIXED NUMBER OF GENES
slices <- unique(c(seq(0,923, 71), seq(923, 1571, 72)))
fitness_df$gene_group <- cut(1:nrow(fitness_df), c(slices),
as.character(seq(length(slices)-1)))
avg_fitness_per_group <- sapply(
split(fitness_df$avg_fit_per_gene,fitness_df$gene_group), mean)
mid_radian_per_group <- sapply(
split(fitness_df$scaled_radians,fitness_df$gene_group), mean)
fitness_group <- data.frame(levels(fitness_df$gene_group),
avg_fitness_per_group, mid_radian_per_group)
colnames(fitness_group) <- c('Group',
'avg_fitness',
'mid radian')
model_group<-lm(avg_fitness ~ `mid radian`, fitness_group)
summary(model_group)
Call:
lm(formula = avg_fitness ~ `mid radian`, data = fitness_group)
Residuals:
Min 1Q Median 3Q Max
-0.0152788 -0.0045553 0.0008201 0.0043868 0.0113487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.028348 0.003405 302.028 < 2e-16 ***
`mid radian` -0.016126 0.001861 -8.664 3.33e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.007931 on 20 degrees of freedom
Multiple R-squared: 0.7896, Adjusted R-squared: 0.7791
F-statistic: 75.07 on 1 and 20 DF, p-value: 3.327e-08
new_group_avg <- predict(model_group, fitness_group)
fitness_df$new_pred_group <-fitness_df$avg_fit_per_gene -
new_group_avg[fitness_df$gene_group] +
mean(fitness_df$avg_fit_per_gene)
fitness_df$new_int_type_group<- cut(fitness_df$new_pred_group,
c(0, 0.96, 1.04, Inf),
c('Disadvantage', 'Neutral', 'Advantage'))ggplot(fitness_df,aes(x=scaled_radians, y=new_pred_group)) +
geom_point(aes(color=new_int_type_group), size=0.5) +
geom_smooth(method = 'lm',color='#990000')+
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Scaled Radians", y = "Avg Gene Fitness",
title= " Corrected Fitness vs Genome Location")ggplot(fitness_df,aes(x=gene_coordinates,y=new_pred_group)) +
geom_point(aes(color=new_int_type_group), size=0.5) +
geom_smooth(method = 'lm', color='#990000') +
scale_color_manual(name= 'Insertion Type',
values = c("#CC6666", "#9999CC", "#66CC99")) +
labs(x = "Genomic Coordinates", y = "Avg Gene Fitness",
title= "Corrected Fitness vs Genome Location")correction<-c('group', 'group', 'group')
table_window<-cbind(as.data.frame(table(fitness_df$new_int_type_group)),
correction)
results <- rbind(results,table_window)
ggplot(results, aes(x=Var1, y= Freq, fill=correction)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_manual(name= 'Correction',
values = c("#CC6666", "#9999CC", "#66CC99", '#FF9966'),
labels=c('No','All Points', "Window 100k", "Group")) +
labs(x = "Insertion Type", y = "Counts") colnames(results) <- c('Insertion Type', 'Count', 'Correction')
# build sumary table
summary_table2 <- xtabs(Count ~., results)
t(summary_table2) %>%
kable(caption = 'Summary table',booktabs=TRUE)%>%
kable_styling(latex_options = "hold_position" ,font_size = 11)| Disadvantage | Neutral | Advantage | |
|---|---|---|---|
| no | 180 | 1168 | 223 |
| all points | 170 | 1235 | 166 |
| window100k | 168 | 1238 | 165 |
| group | 165 | 1238 | 168 |
Overall all different approaches tested in this project returned comparable results in termed of ‘rescued’ genes, that is genes that are neutral, though due to the presence of the bias, they have higher average fitness values and wrongly classified as advantageous.
Until now, we have used the fact that for this genome we know the real ORI and the orientation of the genome. Now we suppose that we do not know the ORI and so we have to find it by exploiting the fact that the bias seems to be larger for genes close to the ORI.
In order to find the ORI and the TER, we will fit a local linear regression (LOESS), which is a method that allow us to fit models with any shape by fitting a linear regression for each subset in which we divide the dataset. After fitting this model we will find the maximum of this curve, which represent the ORI, and the minimun, which represent the TER.
First of all, we select a random coordinates and shift the genome with this, in order to simulate that we do not know the ORI, in this way we see that the U-shape do not appear, but however there are regions with higher average fitness.
new_fitness_df <- fitness_df[,1:4]
#Select random origin
set.seed(103)
random_start <- sample(c(1:genome_size), size = 1)
# Shift coordinates, in this way we simulate that we do not know the ORI
new_gene_coordinates <- new_fitness_df$gene_coordinates - random_start
new_fitness_df$new_gene_coordinates <- ifelse(new_gene_coordinates > 0,
new_gene_coordinates,
new_gene_coordinates + genome_size)
# We do not have the U shape because the genome is shifted, however we can see
#that there is the effect of the bias, because the curve have a hill
ggplot(new_fitness_df, aes(new_gene_coordinates,avg_fit_per_gene)) +
geom_point(size = 1) + geom_smooth(method = 'loess')+
labs(x = "Genomic coordinates", y = "Avg gene fitness",
title = "Shifted genome")Since we want to use LOESS (local linear regression) and this method do not allow us to be fitted for the extremity values of genomic coordinates, we use the fact that the genome is circular and extend the extremities.
top <- new_fitness_df[new_fitness_df$new_gene_coordinates < 150000, ]
top$new_gene_coordinates <- top$new_gene_coordinates + genome_size
bottom <- new_fitness_df[new_fitness_df$new_gene_coordinates > (genome_size - 150000), ]
bottom$new_gene_coordinates <- bottom$new_gene_coordinates - genome_size
new_data <- rbind(bottom, new_fitness_df, top)
ggplot(new_data, aes(new_gene_coordinates,avg_fit_per_gene)) +
geom_point(size = 1) + geom_smooth(method = 'loess')+
labs(x = "Genomic coordinates", y = "Avg gene fitness",
title = "Extended genome to account for LOESS limitation") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_vline(xintercept = genome_size, linetype="dashed")Now we fit the LOESS. We use different span values, in order to understand if changing this parameter will affect the results in a significant way. The span parameter tell us how many neighbors we have to use for the local piece of the linear regression. For each fitted model then we find the maximum value and finally we check the difference between the results.
span_values <- c(0.25, 0.5, 0.75, 1.1)
oris <- c()
for (i in span_values) {
# fit the LOESS
model_loess <- loess(avg_fit_per_gene ~ new_gene_coordinates, data= new_data,
span = i,degree = 2)
summary(model_loess)
# find max, which represent the ORI
fitness_pred <-predict(model_loess, new_fitness_df)
max(fitness_pred)
fitness_pred[fitness_pred == max(fitness_pred)]
max_fit_index <- match(max(fitness_pred), fitness_pred)
max_fit_coord <- new_fitness_df$new_gene_coordinates[max_fit_index]
# refine the research near the found max
window_max <- seq((max_fit_coord - 100000), (max_fit_coord + 100000), 500)
window_max_df <- data.frame( 'new_gene_coordinates' = window_max)
# if coordinates does not fall in [0, genome_size] range we shift to ensure
# it falls within the genome interval
window_max_df$new_gene_coordinates <- ifelse(
window_max_df$new_gene_coordinates < 0,
window_max_df$new_gene_coordinates + genome_size,
window_max_df$new_gene_coordinates)
window_max_df$new_gene_coordinates <- ifelse(
window_max_df$new_gene_coordinates > genome_size,
window_max_df$new_gene_coordinates - genome_size,
window_max_df$new_gene_coordinates)
fitness_pred_max <- predict(model_loess, window_max_df)
fitness_pred_max[fitness_pred_max == max(fitness_pred_max)]
ori_index <- match(max(fitness_pred_max), fitness_pred_max)
ori_coords <- window_max_df[ori_index,]
oris <- c(oris, ori_coords)
}
# The values found using different span values are all near the real ORI, so
# this method works.
genome_size - random_start[1] 1697934
[1] 1587424 1665184 1675415 1853698
As we can see, the values of possible ORIs are close enough to the real ORI, indicating that the span value do not impact too much on the results. This can be because this genome have only one ORI and so the fitted curve is slightly non linear.
Now we choose one ORI e reshift the genome according to this, in order to check if we have th u-shape, that means that we found the correct ORI.
gene_coords_ori <- new_fitness_df$new_gene_coordinates - oris[3]
new_fitness_df$gene_coords_ori <- ifelse(gene_coords_ori > 0,
gene_coords_ori, gene_coords_ori + genome_size)
ggplot(new_fitness_df, aes(gene_coords_ori,avg_fit_per_gene)) +
geom_point(size = 1) + geom_smooth(method = 'loess')+
labs(x = "genomic coordinates", y = "avg gene fitness",
title = "U-shape found by shifting genome")Now we want to find the TER. In order to do this we follow the same steps made before for the ORI, so we fit a LOESS with the selected span value and we find the minimum value of this curve, which is the coordinate that have the less bias and so the genomic coordinate where the replication process end.
model_loess <- loess(avg_fit_per_gene ~ new_gene_coordinates, data= new_data,
span = 0.75, degree = 2)
fitness_pred <-predict(model_loess, new_fitness_df)
min_fit_index <- match(min(fitness_pred), fitness_pred)
min_fit_coord <- new_fitness_df$new_gene_coordinates[min_fit_index]
window_min <- seq((min_fit_coord - 100000), (min_fit_coord + 100000), 500)
window_min_df <- data.frame( 'new_gene_coordinates' = window_min)
window_min_df$new_gene_coordinates <- ifelse(
window_min_df$new_gene_coordinates < 0,
window_min_df$new_gene_coordinates + genome_size,
window_min_df$new_gene_coordinates)
window_min_df$new_gene_coordinates <- ifelse(
window_min_df$new_gene_coordinates > genome_size,
window_min_df$new_gene_coordinates - genome_size,
window_min_df$new_gene_coordinates)
fitness_pred_min <- predict(model_loess, window_min_df)
fitness_pred_min[fitness_pred_min == min(fitness_pred_min)] 198
0.9850598
ter_index <- match(min(fitness_pred_min), fitness_pred_min)
ter_coords <- window_min_df[ter_index,]
ggplot(new_fitness_df, aes(new_gene_coordinates,avg_fit_per_gene)) +
geom_point(size = 1) + geom_smooth(method = 'loess') +
labs(x = "Genomic coordinates", y = "Avg gene fitness",
title = "ORI and TER on shifted genomic coordinates") +
geom_point(aes(x=oris[3], y=max(fitness_pred)), colour="green", size = 3) +
geom_point(aes(x=ter_coords, y=min(fitness_pred)), colour="red", size = 3)As we can see from this plot, ORI and TER are correctly placed respectively on the region with higher and lower average fitness value.